Candidate name: Yash Kamdar

Contact: yashk9516@gmail.com

Kindly reachout on the above given email id if you encounter any problems while running the script.

Problem Statement

You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.

Objective

Identify zip codes that would generate the most profit on short term rentals within New York City.

Data: Zillow and AirBnB

Assumptions:

  • 2 bed room sets are most profitable (as identified by the client)

  • Occupancy rate of 75% has been assumed. Further, it has been assumed that these occupancies would be one day occupancies.

  • The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).

  • The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).

  • All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)

New Assumptions

  • The median of monthly median property prices for a RegionName (zipcode) will correctly reflect the median market cost of property in a zip code

  • For 2017, the median price for that year is considered as the median price for the 6 months because the data is only available for 6 months in 2017.

  • CAGR (Compound Annual Growth Rate) would be an appropriate indicator to forecast property and rent prices.

  • 2020 median property costs would be better forecasted using CAGR for 3 year growth rate and forecasting the 2020 property costs using 2017 median property costs.

  • It has been given that all properties and all square feet within each locale can be assumed to be homogeneous. Hence, each zipcode in a neighbourhood group will have median price for 2020 as the median property price. I have assumed that the per unit square feet property cost for a neighbourhood would be the median price for 2020 / median square foot of properties in a neighbourhood.

  • The increase in rent per night (price) for 2020 would be at the same rate as the increase in property values (3 year CAGR) for that zip code from 2017 to 2020. In case the CAGR is negative then there would be no increase in rent. It will remain the same.

  • Since the break even period is longer number of years, I've used the CAGR for 5 years. This 5 year CAGR was calculated from 2012 - 2017 change in property cost value in Zillow file as the rate at which rent will increase in the years following 2020. Per Night rent would be charged for all the days. 75% occupancy has been assumed in a year for all neighbourhoods and zip codes.

Quality check:

Airbnb:

  • Remove unwanted features
  • Varied names for New York City
  • Price column has special characters such as "$" and "," in them
  • One of the property has Rent as 0.
  • Some Zipcodes are not formatted correctly. Tried to fill null values in zipcode column using an external library called as usszipcode which would return zipcode of a location given its latitude and longitude. However, upon inspecting the values given out by the library and the one's present in the dataset, there was not a consistent match and hence dropped rows where zipcode is not available.
  • Square_feet column has most of the data missing.
  • Cleaning_fee has missing values

Zillow

  • Monthly property median cost from the Year 1996 to 2006 has lot of missing fields
  • all the monthly median cost columns have column names in datetime object format. Changed column name to string type
In [1]:
!pip install handcalcs
!pip install ipywidgets>=7.0.0
!pip install folium
!pip install plotly_express==0.4.0
Requirement already satisfied: handcalcs in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (1.3.0)
Requirement already satisfied: nbconvert<6.0.0,>=5.6.1 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from handcalcs) (5.6.1)
Requirement already satisfied: pyparsing<3.0.0,>=2.4.7 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from handcalcs) (2.4.7)
Requirement already satisfied: more-itertools<9.0.0,>=8.5.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from handcalcs) (8.7.0)
Requirement already satisfied: innerscope<0.3.0,>=0.2.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from handcalcs) (0.2.0)
Requirement already satisfied: toolz in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from innerscope<0.3.0,>=0.2.0->handcalcs) (0.11.1)
Requirement already satisfied: nbformat>=4.4 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (5.0.8)
Requirement already satisfied: bleach in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (3.2.1)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (1.4.3)
Requirement already satisfied: jupyter-core in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (4.7.0)
Requirement already satisfied: jinja2>=2.4 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (2.11.2)
Requirement already satisfied: traitlets>=4.2 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (5.0.5)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (0.8.4)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (0.3)
Requirement already satisfied: pygments in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (2.7.3)
Requirement already satisfied: defusedxml in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (0.6.0)
Requirement already satisfied: testpath in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (0.4.4)
Requirement already satisfied: webencodings in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from bleach->nbconvert<6.0.0,>=5.6.1->handcalcs) (0.5.1)
Requirement already satisfied: packaging in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from bleach->nbconvert<6.0.0,>=5.6.1->handcalcs) (20.8)
Requirement already satisfied: six>=1.9.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from bleach->nbconvert<6.0.0,>=5.6.1->handcalcs) (1.15.0)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from jinja2>=2.4->nbconvert<6.0.0,>=5.6.1->handcalcs) (1.1.1)
Requirement already satisfied: pywin32>=1.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from jupyter-core->nbconvert<6.0.0,>=5.6.1->handcalcs) (227)
Requirement already satisfied: traitlets>=4.2 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (5.0.5)
Requirement already satisfied: ipython-genutils in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbformat>=4.4->nbconvert<6.0.0,>=5.6.1->handcalcs) (0.2.0)
Requirement already satisfied: traitlets>=4.2 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (5.0.5)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbformat>=4.4->nbconvert<6.0.0,>=5.6.1->handcalcs) (3.2.0)
Requirement already satisfied: jupyter-core in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbconvert<6.0.0,>=5.6.1->handcalcs) (4.7.0)
Requirement already satisfied: setuptools in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert<6.0.0,>=5.6.1->handcalcs) (51.0.0.post20201207)
Requirement already satisfied: attrs>=17.4.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert<6.0.0,>=5.6.1->handcalcs) (20.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert<6.0.0,>=5.6.1->handcalcs) (0.17.3)
Requirement already satisfied: six>=1.9.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from bleach->nbconvert<6.0.0,>=5.6.1->handcalcs) (1.15.0)
Requirement already satisfied: pyparsing<3.0.0,>=2.4.7 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from handcalcs) (2.4.7)
Requirement already satisfied: ipython-genutils in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from nbformat>=4.4->nbconvert<6.0.0,>=5.6.1->handcalcs) (0.2.0)
Requirement already satisfied: folium in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (0.12.1)
Requirement already satisfied: branca>=0.3.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from folium) (0.4.2)
Requirement already satisfied: requests in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from folium) (2.25.1)
Requirement already satisfied: numpy in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from folium) (1.19.2)
Requirement already satisfied: jinja2>=2.9 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from folium) (2.11.2)
Requirement already satisfied: jinja2>=2.9 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from folium) (2.11.2)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: idna<3,>=2.5 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from requests->folium) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from requests->folium) (1.26.3)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from requests->folium) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from requests->folium) (2020.12.5)
Requirement already satisfied: plotly_express==0.4.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (0.4.0)
Requirement already satisfied: statsmodels>=0.9.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (0.12.1)
Requirement already satisfied: scipy>=0.18 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.5.2)
Requirement already satisfied: patsy>=0.5 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (0.5.1)
Requirement already satisfied: plotly>=4.0.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (4.14.3)
Requirement already satisfied: pandas>=0.20.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.1.5)
Requirement already satisfied: numpy>=1.11 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.19.2)
Requirement already satisfied: pytz>=2017.2 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from pandas>=0.20.0->plotly_express==0.4.0) (2020.4)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from pandas>=0.20.0->plotly_express==0.4.0) (2.8.1)
Requirement already satisfied: numpy>=1.11 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.19.2)
Requirement already satisfied: numpy>=1.11 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.19.2)
Requirement already satisfied: six in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from patsy>=0.5->plotly_express==0.4.0) (1.15.0)
Requirement already satisfied: six in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from patsy>=0.5->plotly_express==0.4.0) (1.15.0)
Requirement already satisfied: retrying>=1.3.3 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly>=4.0.0->plotly_express==0.4.0) (1.3.3)
Requirement already satisfied: six in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from patsy>=0.5->plotly_express==0.4.0) (1.15.0)
Requirement already satisfied: six in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from patsy>=0.5->plotly_express==0.4.0) (1.15.0)
Requirement already satisfied: numpy>=1.11 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.19.2)
Requirement already satisfied: pandas>=0.20.0 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.1.5)
Requirement already satisfied: numpy>=1.11 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.19.2)
Requirement already satisfied: patsy>=0.5 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (0.5.1)
Requirement already satisfied: scipy>=0.18 in c:\users\ykamd\.conda\envs\ve1\lib\site-packages (from plotly_express==0.4.0) (1.5.2)
In [2]:
#Custom preprocessing modules
from preprocessing import * 
from kpis import *

#Python dependencies
import gzip
import re
import math
import urllib.request, json 
import warnings
warnings.filterwarnings('ignore')

#Data processing libraries
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 150)
import numpy as np

#Data Visualization libraries
import seaborn as sns
from ipywidgets import widgets
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import numpy as np
import handcalcs.render #To render Metrics equations in the form of handwritten calculations


#Imports for ML model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
In [3]:
with gzip.open('listings.csv.gz') as f:
    airbnb = pd.read_csv(f)

In [4]:
#Keeping only important features
subset = ['last_scraped','neighbourhood_group_cleansed','city', 'state', 'zipcode', 'is_location_exact',\
         'market','latitude', 'longitude', 'property_type', 'room_type','accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'square_feet',\
        'price', 'cleaning_fee','extra_people', 'availability_30', 'availability_60', 'availability_90', 
        'availability_365', 'number_of_reviews']
ab = airbnb[subset]
In [5]:
#Dictionary to store all variants of New York name that will be later used to map appropritate names
di = {"Ny": "NY", "ny": "NY", "New York ": "NY"}
ab.replace({"state" : di}, inplace = True)
ab = ab[ab.state == 'NY']
In [6]:
ab = city_cleaner(ab, 'city')
In [7]:
# Selecting only 2 bedroom properties in New York City
ab = ab[(ab.bedrooms==2.0) & (ab.state=="NY") & ((ab.city == "New York City") | (ab.market== "New York"))]
In [8]:
from preprocessing import numeric_cleaner
In [9]:
#Stored currency related columns in an array
currency = ['price', 'extra_people', 'cleaning_fee']
#Cleaning all currency related column
ab = numeric_cleaner(ab, currency)
successfully cleaned price
successfully cleaned extra_people
successfully cleaned cleaning_fee
In [10]:
#One row has price per night for 0.0 which is a data discrepancy
#Used Boolean masking to drop rows where price is 0
mask = ab[ab.price==0] 
ab.drop(mask.index, inplace = True)
In [11]:
ab['price'].describe(percentiles=[0.25,0.50,0.75,0.95,0.975])
Out[11]:
count     6515.000000
mean       230.304528
std        384.143701
min         10.000000
25%        125.000000
50%        175.000000
75%        250.000000
95%        490.000000
97.5%      650.000000
max      10000.000000
Name: price, dtype: float64
In [12]:
px.box(ab, y = 'price', width = 450., title = 'Variation in Rent with Outliers' )

From the statistics and the boxplot above, it is clear that prices for few of the properties are way higher then rest of the properties. We shall remove such properties from our dataset.

In [13]:
#Removing outliers (2.5% of data)
ab = ab[ab['price']<600]
In [14]:
px.box(ab, y = 'price', width = 450., title = 'Variation in Rent after removing Outliers' )

After removing the outliers in the price column, the box plot now describes some acceptable statistics.

In [15]:
# Checking how many variants of zipcode present. Keep a count by length of zipcode
from collections import defaultdict
d = defaultdict(int)
for i in ab['zipcode']:
    d[len(str(i))]+=1
d
Out[15]:
defaultdict(int, {5: 6257, 3: 35, 6: 1, 8: 34})
In [16]:
#Removed rows where Zipcode not available. Tried to fill null values in Zipcode using 3rd party package called
# usszipcode but found it to be inaccurate while manually inspecting the package generated values with values in dataset
ab = ab[ab['zipcode'].isna()==False]
ab = ab[(ab.zipcode.str.len() == 5)]
In [17]:
#We need to fill missing values in Cleaning_fee column based on Neighbourhood_group_cleansed and property_type
compare_stats = ab.groupby(['neighbourhood_group_cleansed', 'property_type'])['cleaning_fee'].describe(percentiles = [0.25,0.50,0.75,0.95])
compare_stats 
Out[17]:
count mean std min 25% 50% 75% 95% max
neighbourhood_group_cleansed property_type
Bronx Apartment 59.0 70.033898 38.884873 0.0 42.50 60.0 100.00 150.00 179.0
Casa particular (Cuba) 1.0 80.000000 NaN 80.0 80.00 80.0 80.00 80.00 80.0
Condominium 5.0 83.000000 58.694122 10.0 35.00 100.0 120.00 144.00 150.0
Cottage 1.0 0.000000 NaN 0.0 0.00 0.0 0.00 0.00 0.0
Dome house 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
Guest suite 6.0 74.000000 61.789967 0.0 35.00 69.5 97.25 156.25 175.0
House 21.0 82.238095 49.909823 0.0 50.00 95.0 125.00 150.00 150.0
Loft 2.0 75.000000 70.710678 25.0 50.00 75.0 100.00 120.00 125.0
Townhouse 7.0 62.714286 44.006493 15.0 40.00 50.0 72.50 129.80 149.0
Villa 1.0 275.000000 NaN 275.0 275.00 275.0 275.00 275.00 275.0
Brooklyn Apartment 1845.0 85.001084 47.443722 0.0 50.00 80.0 100.00 150.00 600.0
Boutique hotel 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
Castle 1.0 150.000000 NaN 150.0 150.00 150.0 150.00 150.00 150.0
Cave 1.0 65.000000 NaN 65.0 65.00 65.0 65.00 65.00 65.0
Condominium 111.0 105.018018 49.489757 0.0 80.00 100.0 130.00 200.00 300.0
Guest suite 31.0 69.967742 35.951341 0.0 47.50 80.0 98.50 125.00 125.0
Guesthouse 9.0 86.666667 31.819805 30.0 75.00 80.0 100.00 130.00 150.0
Hotel 2.0 60.000000 14.142136 50.0 55.00 60.0 65.00 69.00 70.0
House 167.0 86.053892 52.199370 0.0 50.00 85.0 120.00 178.50 250.0
Loft 112.0 94.750000 46.742249 0.0 63.75 100.0 130.00 152.25 200.0
Other 2.0 82.500000 3.535534 80.0 81.25 82.5 83.75 84.75 85.0
Serviced apartment 1.0 410.000000 NaN 410.0 410.00 410.0 410.00 410.00 410.0
Tiny house 1.0 30.000000 NaN 30.0 30.00 30.0 30.00 30.00 30.0
Townhouse 144.0 96.375000 49.704846 0.0 60.00 92.5 125.00 199.25 250.0
Manhattan Aparthotel 2.0 50.000000 70.710678 0.0 25.00 50.0 75.00 95.00 100.0
Apartment 1931.0 114.107198 66.264124 0.0 75.00 100.0 150.00 250.00 500.0
Boutique hotel 8.0 33.750000 37.008686 0.0 0.00 25.0 70.00 76.50 80.0
Condominium 121.0 113.380165 51.822495 0.0 99.00 100.0 150.00 200.00 250.0
Guest suite 4.0 110.000000 63.770422 50.0 80.00 95.0 125.00 185.00 200.0
Guesthouse 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
Hotel 2.0 0.000000 0.000000 0.0 0.00 0.0 0.00 0.00 0.0
House 12.0 130.750000 71.072466 25.0 87.50 135.0 181.25 222.50 250.0
Loft 64.0 129.781250 72.719611 0.0 91.25 122.5 150.00 248.50 395.0
Other 2.0 80.000000 28.284271 60.0 70.00 80.0 90.00 98.00 100.0
Resort 1.0 100.000000 NaN 100.0 100.00 100.0 100.00 100.00 100.0
Serviced apartment 58.0 184.810345 125.416423 0.0 100.00 187.5 250.00 410.00 410.0
Townhouse 29.0 100.862069 55.446708 0.0 80.00 105.0 150.00 158.40 200.0
Villa 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
Queens Apartment 314.0 76.866242 45.945181 0.0 50.00 75.0 100.00 150.00 350.0
Bed and breakfast 1.0 60.000000 NaN 60.0 60.00 60.0 60.00 60.00 60.0
Bungalow 5.0 61.800000 16.619266 49.0 50.00 50.0 80.00 80.00 80.0
Cabin 1.0 0.000000 NaN 0.0 0.00 0.0 0.00 0.00 0.0
Camper/RV 3.0 51.666667 7.637626 45.0 47.50 50.0 55.00 59.00 60.0
Condominium 31.0 77.129032 38.483539 0.0 65.00 75.0 100.00 125.00 200.0
Cottage 2.0 75.000000 35.355339 50.0 62.50 75.0 87.50 97.50 100.0
Farm stay 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
Guest suite 13.0 72.307692 44.422938 25.0 30.00 60.0 100.00 150.00 150.0
Guesthouse 3.0 43.333333 45.092498 0.0 20.00 40.0 65.00 85.00 90.0
House 106.0 72.254717 46.508998 0.0 41.25 70.0 96.75 150.00 280.0
Lighthouse 1.0 98.000000 NaN 98.0 98.00 98.0 98.00 98.00 98.0
Loft 5.0 103.600000 57.743398 50.0 70.00 99.0 99.00 179.80 200.0
Other 1.0 35.000000 NaN 35.0 35.00 35.0 35.00 35.00 35.0
Serviced apartment 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
Townhouse 16.0 96.312500 47.089587 50.0 69.00 89.5 100.00 163.75 250.0
Staten Island Apartment 24.0 63.333333 23.311999 0.0 50.00 60.0 76.25 100.00 100.0
Guest suite 5.0 43.800000 13.863621 19.0 50.00 50.0 50.00 50.00 50.0
House 19.0 55.473684 31.346395 0.0 40.00 50.0 85.00 98.00 125.0
Loft 1.0 100.000000 NaN 100.0 100.00 100.0 100.00 100.00 100.0
In [18]:
compare_stats = compare_stats.reset_index()
In [19]:
compare_stats_bronx = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Bronx']
compare_stats_manhattan = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Manhattan']
compare_stats_Queens = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Queens']
compare_stats_StatenIsland = compare_stats[compare_stats['neighbourhood_group_cleansed'] == 'Staten Island']
In [20]:
from plotly.subplots import make_subplots
neighborhoods = [compare_stats_bronx, compare_stats_manhattan, compare_stats_Queens, compare_stats_StatenIsland]
name = ['Bronx', 'Manhattan', 'Queens', 'Staten Island']
fig = make_subplots(rows=2, cols=2, vertical_spacing = 0.30, horizontal_spacing = 0.20, row_heights =[800, 800],\
                   subplot_titles = ['{} Cleaning fee trend (95%ile and max value)'.format(title) for title in name],
)


r = 1
c = 1
i = 0
for  index, df in enumerate(neighborhoods):
    a = fig.add_trace(go.Scatter(x=df['property_type'], y=df['95%']), row=r, col=c )
    b = fig.add_trace( go.Scatter(x=df['property_type'], y=df['max']), row=r, col=c )
    a.update_layout(height = 800)
    if(r==1 and c==1):
        c+=1
    elif(r==1 and c==2):
        r+=1
        c-=1
        continue
    elif(r==2 and c==1):
        c+=1
    i+=1
    
fig.show()

The four plots show statistical trends for 95 percentile and maximum value of Cleaning fees for the Bronx, Manhattan, Queens and Staten Island neighborhoods. The cleaning_fee columns comprises of 847 missing values and it was important to look at these statistics so that we can decide whether can fill the null values by "Mean" or "Median" cleaning fee of their repective neighborhoods and property type. We can see stark differences for the following observations:

  • Bronx: For Apartment type of property, the max cleaning fee is 600 while 95 percentile of the apartments have a cleaning fee of 180 or less.
  • Manhattan: For Apartment type of property, the max cleaning fee is 500 while 95 percentile of the apartments have a cleaning fee of 200 or less.
In [21]:
#Filling null values in Cleaning_fee column by median cleaning_Fee as per neighborhood and property type
ab["cleaning_fee"] = ab.groupby(['neighbourhood_group_cleansed', 'property_type'])['cleaning_fee'].transform(lambda x: x.fillna(x.median()))
ab = ab[ab['cleaning_fee'].notna()]

There are only 97 observations with square feet area. We cannot expect to get a highly accurate model by training on such less number of values but as proof of concept for this project, we will go ahead and build it since this column will play an important role in calculating cost of properties. We're being told to assume that property cost and square foot area are homogenous in a given neighbourhood.

Hence we will build a model that would predict the square_feet of the property.
Inorder to determine which feature best corelates with the square_feet column, let's do some analysis.

In [22]:
#Calculating median rent for each neighbourhood
median_rent = ab.groupby(['neighbourhood_group_cleansed'])[['price']].agg('median')
median_rent.columns = ['median_rent']
In [23]:
fig = px.bar(median_rent, x=median_rent.index , y  = 'median_rent',\
    text = median_rent['median_rent'], title = 'Median rent across Neighborhoods',height = 600, orientation = 'v', color = median_rent.index  )
fig

We see that Manhattan is the most expensive neighborhood to rent property which is followed by Brooklyn and Queens.

Now, let's check trend for Rent and cleaning fee with Square feet.

In [24]:
fig = px.scatter(ab, x= "square_feet", y='price', color= "neighbourhood_group_cleansed", trendline = 'ols',\
           labels = {"square_feet": "Square feet", "price": "Rent"}, title = 'Square feet vs Rent',\
   height = 750 )

fig.update_traces(marker=dict(size=12),
                  selector=dict(mode='markers'))
fig

From the above scatter plot, it can be inferred that for all the neighborhood groups, rent increases with size of the property. The trendline for Queens neighborhood does not show a strong linear relationship because of fewer datapoints.

In [25]:
fig = px.scatter(ab, x= "square_feet", y='cleaning_fee', color= "neighbourhood_group_cleansed", trendline = 'ols',\
           labels = {"square_feet": "Square feet", "cleaning_fee": "Cleaning fee"}, title = 'Square feet vs Cleaning fee',\
   height = 750 )

fig.update_traces(marker=dict(size=12),
                  selector=dict(mode='markers'))
fig

The above plot shows how trend for Cleaning fee changes with respect to size of the property (square feet). From the current 97 data points that we have for square feet, it does not appear that there's a strong relationship between them.

In [26]:
fig = px.histogram(ab, x="square_feet",  color="room_type", 
                   title = 'Distribution of properties as per their room-_type and square_feet')
fig
  • Most of the Entire home or apartments have a size between 700 to 1000 sq.feet.
  • We cannot conclude anything about Private rooms since they have varied sizes and also they're only 7 in count.

Now, let's plot a corelation heatmap to identify the features that corelate with square_feet

In [27]:
#Get all continuous variables
for_heatmap = ab[['square_feet', 'price', 'cleaning_fee',
       'extra_people', 'number_of_reviews']]
In [28]:
# Increase the size of the heatmap.
# Store heatmap object in a variable to easily access it when you want to include more features (such as title).
# Setting the range of values to be displayed on the colormap from -1 to 1, and setting the annotation to True to display the correlation values on the heatmap.
heatmap = sns.heatmap(for_heatmap.corr(), vmin=-1, vmax=1, annot=True)
# Giving a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

We see that only Price feature shows some corelation with Square_feet.

In [29]:
#Seaprating out the features that seems to be useful to predict square_feet
ab_for_ml = ab[['zipcode', 'accommodates', 'bathrooms',
       'bedrooms', 'beds',  'price', 'cleaning_fee','neighbourhood_group_cleansed','room_type', 'square_feet']]
In [30]:
#filtering out null values from square feet column so that the dataset thus curated can be used to train ML model
ab_known = ab_for_ml[ab_for_ml.square_feet.isna()==False]
In [31]:
#X will be the predictor columns
X = ab_known[['zipcode','neighbourhood_group_cleansed','room_type','accommodates', 'bathrooms',
       'bedrooms', 'beds',  'price', 'cleaning_fee']]
#We will need to process categorical and numerical features separately before passing it to ML model. Hence creating
#a separate list of categorical and numerical features
#col_cat is a list of Categorical features
col_cat = ['zipcode','neighbourhood_group_cleansed','room_type']
#col_num is a list of Numerical features
col_num = ['accommodates', 'bathrooms', 'bedrooms', 'beds',  'price', 'cleaning_fee']
#X_num is the dataframe with numerical features
X_num = ab_known[col_num]
#X_cat is the dataframe with categorical features
X_cat = ab_known[col_cat]
#y is the target label 
y = ab_known[['square_feet']]
In [32]:
#Splitting dataset (wherein we have known square_feet values) into Training and testing. 
#Training will comprise of 80% of the data and testing will comprise of 20% of the data.


X__train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

Lasso regression with hyper-parameter tuning using GridSearchCV

In [33]:
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

def MSE(y_true,y_pred):
    mse = mean_squared_error(y_true, y_pred)
    print(f'MSE: {mse}')
    return mse

def R2(y_true,y_pred):    
    r2 = r2_score(y_true, y_pred)
    print(f'R2: {r2}')
    return r2

def two_score(y_true,y_pred):    
    MSE(y_true,y_pred) #set score here and not below if using MSE in GridCV
    score = R2(y_true,y_pred)
    return score

def two_scorer():
    return make_scorer(two_score, greater_is_better=True) 

Training the model

In [34]:
from sklearn.linear_model import Lasso
pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))
pipe_num = make_pipeline(StandardScaler(), SimpleImputer())
preprocessor = make_column_transformer((pipe_cat, col_cat), (pipe_num, col_num))
pipe = Pipeline(steps = [('preprocessor',preprocessor),
                 ('lasso', Lasso(max_iter=10000))
                ])
param_grid = {
    'lasso__alpha':[0.03, 0.3, 3]}
lasso_grid = GridSearchCV(estimator = pipe, param_grid=param_grid, scoring = two_scorer(), cv=10, n_jobs=-1, return_train_score=True)

lasso_grid.fit(X__train,y_train)
Out[34]:
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('pipeline-1',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer(strategy='constant')),
                                                                                         ('onehotencoder',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['zipcode',
                                                                          'neighbourhood_group_cleansed',
                                                                          'room_type']),
                                                                        ('pipeline-2',
                                                                         Pipeline(steps=[('standardscaler',
                                                                                          StandardScaler()),
                                                                                         ('simpleimputer',
                                                                                          SimpleImputer())]),
                                                                         ['accommodates',
                                                                          'bathrooms',
                                                                          'bedrooms',
                                                                          'beds',
                                                                          'price',
                                                                          'cleaning_fee'])])),
                                       ('lasso', Lasso(max_iter=10000))]),
             n_jobs=-1, param_grid={'lasso__alpha': [0.03, 0.3, 3]},
             return_train_score=True, scoring=make_scorer(two_score))

Obtain best parameters

In [35]:
print(lasso_grid.best_estimator_)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['zipcode',
                                                   'neighbourhood_group_cleansed',
                                                   'room_type']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('standardscaler',
                                                                   StandardScaler()),
                                                                  ('simpleimputer',
                                                                   SimpleImputer())]),
                                                  ['accommodates', 'bathrooms',
                                                   'bedrooms', 'beds', 'price',
                                                   'cleaning_fee'])])),
                ('lasso', Lasso(alpha=3, max_iter=10000))])

Score on training data

In [36]:
mean_squared_error(y_train, lasso_grid.predict(X__train))

lasso_grid.score(X__train, y_train)
MSE: 113526.25804431346
R2: 0.5131370030822018
Out[36]:
0.5131370030822018

Score on testing data

In [37]:
print("Training set score: {:.2f}".format(lasso_grid.score(X__train, y_train)))
print("Test set score: {:.2f}".format(lasso_grid.score(X_test, y_test)))
# print("Number of features used:", np.sum(lasso_grid.coef_ != 0))
MSE: 113526.25804431346
R2: 0.5131370030822018
Training set score: 0.51
MSE: 113945.26791054332
R2: -0.7678883819626521
Test set score: -0.77

['bathrooms', 'bedrooms', 'beds', 'square_feet', 'price', 'cleaning_fee', 'extra_people', 'review_scores_rating', 'review_scores_location', 'reviews_per_month']

In [38]:
# Creating the dataset to predict missing square_feet
sqfeet_predict = ab[ab.square_feet.isna()==True]
sqfeet_predict = sqfeet_predict[['zipcode','neighbourhood_group_cleansed','room_type','accommodates', 'bathrooms',
       'bedrooms', 'beds',  'price', 'cleaning_fee', 'square_feet']]
X_pred = sqfeet_predict[['zipcode','neighbourhood_group_cleansed','room_type','accommodates', 'bathrooms',
       'bedrooms', 'beds',  'price', 'cleaning_fee']]
col_cat = ['zipcode','neighbourhood_group_cleansed','room_type']
col_num = ['accommodates', 'bathrooms', 'bedrooms', 'beds',  'price', 'cleaning_fee']
X_num_pred = sqfeet_predict[col_num]
X_cat_pred = sqfeet_predict[col_cat]
y_pred = sqfeet_predict[['square_feet']]

Make predictions

In [39]:
#Predicting missing square_feet values by passing predictor columns to the lasso model
y_predicted = lasso_grid.predict(X_pred)
In [40]:
#Assigning the predicted values to the dataset
sqfeet_predict['square_feet'] = y_predicted
In [41]:
#Merging the dataframe containing predicted values for square_feet with the original dataframe containing missing values
ab_after_prediction = pd.merge(ab, sqfeet_predict, how = 'inner', left_on = ab.index ,right_on = sqfeet_predict.index)
In [42]:
ab_after_prediction.head()
Out[42]:
key_0 last_scraped neighbourhood_group_cleansed_x city state zipcode_x is_location_exact market latitude longitude property_type room_type_x accommodates_x bathrooms_x bedrooms_x beds_x bed_type square_feet_x price_x cleaning_fee_x extra_people availability_30 availability_60 availability_90 availability_365 number_of_reviews zipcode_y neighbourhood_group_cleansed_y room_type_y accommodates_y bathrooms_y bedrooms_y beds_y price_y cleaning_fee_y square_feet_y
0 6 2020-07-08 Brooklyn Brooklyn NY 11232 t New York 40.66120 -73.99423 Apartment Entire home/apt 4 1.5 2.0 2.0 Real Bed NaN 253.0 80.0 25.0 23 53 83 358 1 11232 Brooklyn Entire home/apt 4 1.5 2.0 2.0 253.0 80.0 1006.610370
1 38 2020-07-08 Brooklyn Brooklyn NY 11206 t New York 40.70420 -73.93560 Loft Entire home/apt 2 1.0 2.0 0.0 Real Bed NaN 228.0 128.0 10.0 13 43 73 159 95 11206 Brooklyn Entire home/apt 2 1.0 2.0 0.0 228.0 128.0 726.093036
2 46 2020-07-08 Manhattan New York City NY 10001 t New York 40.74623 -73.99530 Apartment Entire home/apt 4 2.0 2.0 2.0 Real Bed NaN 400.0 125.0 0.0 0 0 28 118 5 10001 Manhattan Entire home/apt 4 2.0 2.0 2.0 400.0 125.0 1490.919830
3 47 2020-07-08 Manhattan New York City NY 10162 f New York 40.77065 -73.95269 Apartment Entire home/apt 6 1.0 2.0 4.0 Real Bed NaN 200.0 200.0 20.0 30 60 75 306 82 10162 Manhattan Entire home/apt 6 1.0 2.0 4.0 200.0 200.0 747.392100
4 53 2020-07-07 Brooklyn Brooklyn NY 11215 t New York 40.67343 -73.98338 Condominium Entire home/apt 4 2.0 2.0 1.0 Real Bed NaN 225.0 100.0 0.0 0 0 0 0 4 11215 Brooklyn Entire home/apt 4 2.0 2.0 1.0 225.0 100.0 1176.196190
In [43]:
ab_after_prediction.drop(columns = ['zipcode_y', 'neighbourhood_group_cleansed_y', 'room_type_y',
       'accommodates_y', 'bathrooms_y', 'bedrooms_y', 'beds_y', 'price_y',
       'cleaning_fee_y'], inplace = True)
In [44]:
#Replacing null values in the square_feet column with the from the predicted square_feet column
ab_after_prediction['square_feet_x'].fillna(ab_after_prediction['square_feet_y'], inplace = True)
In [45]:
ab_after_prediction['square_feet_x'].describe()
Out[45]:
count    6152.000000
mean      807.688178
std       294.218033
min      -612.488796
25%       738.042779
50%       799.079939
75%       878.620604
max      2136.569236
Name: square_feet_x, dtype: float64
In [46]:
#Square_feet cannot be negative. Hence we will drop these 24 rows from the dataset
(ab_after_prediction[ab_after_prediction['square_feet_x']<0]).shape[0]
Out[46]:
129
In [47]:
#Filtered dataset with only positive values for square_feet
ab_after_prediction = ab_after_prediction[ab_after_prediction['square_feet_x']>0]
In [48]:
#Statistics for the Square_feet colunm to check whether we have any outliers in the dataset
ab_after_prediction['square_feet_x'].describe(percentiles = [0.05,0.1,0.25,0.50,0.75,0.85,0.95,0.99])
Out[48]:
count    6023.000000
mean      827.636650
std       262.987077
min         5.222012
5%        319.858395
10%       475.188743
25%       745.778065
50%       800.590670
75%       883.812734
85%      1132.205961
95%      1249.857911
99%      1561.108124
max      2136.569236
Name: square_feet_x, dtype: float64
In [49]:
fig = px.scatter(ab_after_prediction, x= "square_feet_x", y='price_x', color= "neighbourhood_group_cleansed_x", trendline = 'ols',\
           labels = {"square_feet_X": "Square feet", "price_x": "Rent"}, title = 'Square feet vs Rent',\
   height = 750 )

fig.update_traces(marker=dict(size=5),
                  selector=dict(mode='markers')
                 )
fig

The above graph shows a linear relationship between square_feet (including predicted values) and rent for all the neighborhoods except Staten Island.

In [50]:
ab_after_prediction.drop(columns = ['key_0','is_location_exact', 'square_feet_y'], inplace = True)
#Renaming columns
ab_after_prediction.columns = ['last_scraped', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode',
        'market', 'latitude', 'longitude', 'property_type',
       'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds',
       'bed_type', 'square_feet', 'price', 'cleaning_fee',
       'extra_people', 'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews']
ab_after_prediction.dropna(inplace = True)
In [51]:
# ab_after_prediction.to_csv('ab_after_prediction.csv')

Zillow data

In [52]:
# ab_after_prediction = pd.read_csv('ab_after_prediction.csv')
ab_after_prediction['zipcode'] = ab_after_prediction['zipcode'].astype('str')
In [53]:
zillow = pd.read_excel('(DC 4) Zip_Zhvi_2bedroom (1).xlsx')
In [54]:
#We see that few of the columns have their names in datetime format. We need to change them to string.
#Lets first extract those datetime columns
date_cols = [col for index, col in enumerate(list(zillow.columns)) if index >6]
#Converting them to string "yyyy-mm" format
date_cols_converted = [col.strftime('%Y-%m')for col in date_cols]
#Creating a new list of modified column names
zillow_cols = list(zillow.columns)[:7] + date_cols_converted
#Assigning new column names to original data
zillow.columns = zillow_cols
In [55]:
#Keep only those rows where City is New york and State is New York
zillow = zillow[(zillow['City']=='New York') & (zillow['State'] == 'NY')]
In [56]:
#Heatmap to check missing patterns in the data
cmap = sns.cubehelix_palette(light=1, as_cmap=True, reverse=True)
sns.heatmap(zillow.isnull(), cmap = cmap)
Out[56]:
<AxesSubplot:>

For monthly median cost of properties, we do not have any cost missing after 2007. Hence we will drop data from the year 1996 to 2006.

In [57]:
#unpivoting the dataframe (Converting from wide to long format)
zillow = zillow.melt(id_vars=['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
       'SizeRank'], var_name='date', value_name='price')
In [58]:
#splitting the date column into Year and Month
month_and_year =  zillow['date'].str.split('-',expand = True)
#create separate column for year
zillow['year'] = month_and_year[0]
#create separate column for month
zillow['month'] = month_and_year[1]
In [59]:
zillow.isna().sum()
Out[59]:
RegionID         0
RegionName       0
City             0
State            0
Metro            0
CountyName       0
SizeRank         0
date             0
price         1696
year             0
month            0
dtype: int64
In [60]:
#Drop all 1996 year rows
zillow = zillow[zillow.year>str(2006)]
In [61]:
#Yearly median price by Zipcode
zillow = zillow.groupby(['RegionName','CountyName','year'])[['price']].agg('median')
zillow = zillow.reset_index()
zillow.head()
Out[61]:
RegionName CountyName year price
0 10003 New York 2007 1414900.0
1 10003 New York 2008 1532950.0
2 10003 New York 2009 1396050.0
3 10003 New York 2010 1278700.0
4 10003 New York 2011 1363050.0
In [62]:
zillow = zillow.pivot(index = ['RegionName', 'CountyName'] , columns = ['year'], values = 'price')
zillow = zillow.reset_index()
zillow.head()
Out[62]:
year RegionName CountyName 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
0 10003 New York 1414900.0 1532950.0 1396050.0 1278700.0 1363050.0 1370400.0 1533100.0 1735500.0 1809000.0 1935350.0 2005500.0
1 10011 New York 1528250.0 1676250.0 1508800.0 1446300.0 1435900.0 1542250.0 1798350.0 2029500.0 2066800.0 2268000.0 2354000.0
2 10013 New York 2105650.0 2309800.0 2250650.0 2278500.0 2323650.0 2634900.0 2739600.0 3063700.0 3083850.0 3300500.0 3212450.0
3 10014 New York 1556400.0 1778950.0 1555600.0 1587450.0 1650450.0 1751050.0 1983800.0 2087200.0 2235200.0 2409150.0 2476250.0
4 10021 New York 1243550.0 1408750.0 1184350.0 1203000.0 1252550.0 1278800.0 1313850.0 1456550.0 1471850.0 1624800.0 1709950.0

Step 7.1 CAGR

  • Since the Airbnb data is for the year 2020 and latest median monthly cost of properties in Zillow data is for the year 2017, we will need to estimate the Median cost of properties for the year 2020.
  • Hence we will use CAGR (Compound Annual Growth Rate) as a metric to estimate Median Property prices for the year 2020 for each Zipcode.
  • We will use 3-year CAGR between 2014 and 2017 and use that rate to estimate 2020 median property prices by taking into account 2017 median prices.
  • The formula and sample calculation is shown below
In [63]:
%%render
medianPrice_2012 = 1000000.0 #Consider median price for properties a zipcode for the year 2012 is  1M USD
medianPrice_2017 = 2000000.0 #Consider median price for properties in a zipcode for the year 2017 is 2M USD
n = 5 

CAGR = (((medianPrice_2017/medianPrice_2012)**(1/n))-1)*100 
\[ \begin{aligned} \mathrm{medianPrice}_{2012} &= 1000000.0 \; \;\textrm{(Consider median price for properties a zipcode for the year 2012 is 1M USD)} \\[10pt] \mathrm{medianPrice}_{2017} &= 2000000.0 \; \;\textrm{(Consider median price for properties in a zipcode for the year 2017 is 2M USD)} \\[10pt] n &= 5 \; \\[10pt] \mathrm{CAGR} &= \left( \left( \frac{ \mathrm{medianPrice}_{2017} }{ \mathrm{medianPrice}_{2012} } \right) ^{ \left( \frac{ 1 }{ n } \right) } - 1 \right) \cdot 100 \\&= \left( \left( \frac{ 2000000.0 }{ 1000000.0 } \right) ^{ \left( \frac{ 1 }{ 5 } \right) } - 1 \right) \cdot 100 \\&= 14.87 \\ \end{aligned} \]
In [64]:
#Calculating CAGR of 2, 5 and 10 years respectively
zillow['cagr_10'] = rate(zillow, '2017', '2007')
zillow['cagr_5'] = rate(zillow, '2012', '2017')
zillow['cagr_2'] = rate(zillow, '2015', '2017')
zillow['cagr_3'] = rate(zillow, '2014', '2017')
In [65]:
zillow['RegionName'] = zillow['RegionName'].astype('str') 
In [66]:
zillow.head()
Out[66]:
year RegionName CountyName 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 cagr_10 cagr_5 cagr_2 cagr_3
0 10003 New York 1414900.0 1532950.0 1396050.0 1278700.0 1363050.0 1370400.0 1533100.0 1735500.0 1809000.0 1935350.0 2005500.0 3.550 7.913 5.291 4.938
1 10011 New York 1528250.0 1676250.0 1508800.0 1446300.0 1435900.0 1542250.0 1798350.0 2029500.0 2066800.0 2268000.0 2354000.0 4.415 8.825 6.722 5.068
2 10013 New York 2105650.0 2309800.0 2250650.0 2278500.0 2323650.0 2634900.0 2739600.0 3063700.0 3083850.0 3300500.0 3212450.0 4.315 4.043 2.064 1.593
3 10014 New York 1556400.0 1778950.0 1555600.0 1587450.0 1650450.0 1751050.0 1983800.0 2087200.0 2235200.0 2409150.0 2476250.0 4.753 7.176 5.254 5.863
4 10021 New York 1243550.0 1408750.0 1184350.0 1203000.0 1252550.0 1278800.0 1313850.0 1456550.0 1471850.0 1624800.0 1709950.0 3.236 5.983 7.785 5.492
In [67]:
#Merging Airbnb and Zillow data
df_merged = pd.merge(ab_after_prediction, zillow, how = 'inner', left_on = 'zipcode', right_on = 'RegionName')
In [68]:
# df_merged.drop('key_0', axis =1 ,inplace = True)
In [69]:
%%render
MedianCost_2017 = 1000000
CAGR_3year = 5.0
MedianCost_2020 = (MedianCost_2017)*((1+(CAGR_3year/100))**2)
\[ \begin{aligned} \mathrm{MedianCost}_{2017} &= 1000000 \; \\[10pt] \mathrm{CAGR}_{3year} &= 5.0 \; \\[10pt] \mathrm{MedianCost}_{2020} &= \mathrm{MedianCost}_{2017} \cdot \left( 1 + \left( \frac{ \mathrm{CAGR}_{3year} }{ 100 } \right) \right) ^{ 2 } = 1000000 \cdot \left( 1 + \left( \frac{ 5.0 }{ 100 } \right) \right) ^{ 2 } &= 1102500.0 \end{aligned} \]
In [70]:
#Calculate 2020 median cost of properties by forecasting
df_merged['2020'] =  round((df_merged['2017']*((1+(df_merged.cagr_3/100))**2)),1)
In [71]:
# df_merged['per_sq_feet'] = df_merged['2020']/df_merged['square_feet_x']

Calculating Median Square Feet as per Zipcode

In [72]:
zipcode_median_square_feet = df_merged.groupby(['zipcode'])[['square_feet']].agg('median')
zipcode_median_square_feet = zipcode_median_square_feet.reset_index()
zipcode_median_square_feet.columns = ['zipcode', 'median_sq_feet']
df_merged =  pd.merge(df_merged, zipcode_median_square_feet, how = 'left', left_on = 'zipcode',\
                        right_on = 'zipcode')
In [73]:
%%render

medianPrice_2020_forZipcodexxxxx = 1000000.0 #Median price for properties in a particular zipcode for the year 2012
medianSquareFeet_forZipcodexxxxx = 1200.0 #Median Square feet for all properties in a particular Zipcode. Obtained from Airbnb dataset


CostPerSquareFoot_forZipcodexxxxx = (medianPrice_2020_forZipcodexxxxx)/(medianSquareFeet_forZipcodexxxxx) #This will be the cost per square foot for a particular Zip Code
\[ \begin{aligned} \mathrm{medianPrice}_{2020_{forZipcodexxxxx}} &= 1000000.0 \; \;\textrm{(Median price for properties in a particular zipcode for the year 2012)} \\[10pt] \mathrm{medianSquareFeet}_{forZipcodexxxxx} &= 1200.0 \; \;\textrm{(Median Square feet for all properties in a particular Zipcode. Obtained from Airbnb dataset)} \\[10pt] \mathrm{CostPerSquareFoot}_{forZipcodexxxxx} &= \frac{ \mathrm{medianPrice}_{2020_{forZipcodexxxxx}} }{ \mathrm{medianSquareFeet}_{forZipcodexxxxx} } = \frac{ 1000000.0 }{ 1200.0 } &= 833.333 \; \;\textrm{(This will be the cost per square foot for a particular Zip Code)} \end{aligned} \]
In [74]:
# df_merged['cost_per_sq_foot'] =  round((df_merged['2020']/df_merged['median_sq_feet']),1)
df_merged = cost_per_foot(df_merged)
In [75]:
%%render
CostPerSquareFoot_forZipcodexxxxx = 833.33
SquareFootofProperty_inZipcodexxxxx = 1000
CostOfProperty_2020 = CostPerSquareFoot_forZipcodexxxxx * SquareFootofProperty_inZipcodexxxxx
\[ \begin{aligned} \mathrm{CostPerSquareFoot}_{forZipcodexxxxx} &= 833.33 \; \\[10pt] \mathrm{SquareFootofProperty}_{inZipcodexxxxx} &= 1000 \; \\[10pt] \mathrm{CostOfProperty}_{2020} &= \mathrm{CostPerSquareFoot}_{forZipcodexxxxx} \cdot \mathrm{SquareFootofProperty}_{inZipcodexxxxx} \\&= 833.33 \cdot 1000 \\&= 833330.0 \\ \end{aligned} \]
In [76]:
# df_merged['cost_2020'] = round((df_merged['cost_per_sq_foot']*df_merged['square_feet']), 1)
df_merged = cost_each_property(df_merged)
  • It is a measure to calculate ROI using increase in value of equity assuming equity prices will appreciate every year.
  • In our case, it can be an important metric from the Investor's point of view to know how appreciation in the Price of Property alone yields profits.
  • We will be calculating ROE for 10 years without including income from rent.
  • CAGR for 10 years has already been calculated using the same formula above for the year 2007 - 2017
In [77]:
%%render
CostOfProperty_2020 = 1000000
CAGR_10yrs = 3.550
RoiEquity_10yrs = ((CostOfProperty_2020*(1+(CAGR_10yrs/100))**10) -  CostOfProperty_2020)/CostOfProperty_2020
\[ \begin{aligned} \mathrm{CostOfProperty}_{2020} &= 1000000 \; \\[10pt] \mathrm{CAGR}_{10yrs} &= 3.55 \; \\[10pt] \mathrm{RoiEquity}_{10yrs} &= \frac{ \left( \mathrm{CostOfProperty}_{2020} \cdot \left( 1 + \left( \frac{ \mathrm{CAGR}_{10yrs} }{ 100 } \right) \right) ^{ 10 } \right) - \mathrm{CostOfProperty}_{2020} }{ \mathrm{CostOfProperty}_{2020} } \\&= \frac{ \left( 1000000 \cdot \left( 1 + \left( \frac{ 3.55 }{ 100 } \right) \right) ^{ 10 } \right) - 1000000 }{ 1000000 } \\&= 4.174 \times 10 ^ {-1 } \\ \end{aligned} \]
In [78]:
# df_merged['roi_equity'] = ((df_merged['cost_2020'] * ((1 +(df_merged['cagr_10']/100))**10)) - df_merged['cost_2020'])/df_merged['cost_2020']
#Function to calculate roi_equity
df_merged = roi_equity(df_merged)
In [79]:
%%render
rentPerNight = 200
AnnualRent_2020 = rentPerNight*0.75*365 
\[ \begin{aligned} \mathrm{rentPerNight} &= 200 \; \\[10pt] \mathrm{AnnualRent}_{2020} &= \mathrm{rentPerNight} \cdot 0.75 \cdot 365 = 200 \cdot 0.75 \cdot 365 &= 54750.0 \end{aligned} \]
In [80]:
df_merged['Annual_rent'] = df_merged['price']*0.75*365
  • #### Inorder to calculate ROI in 10 years, we will consider appreciation in the cost of property for 10 years using the CAGR calculated for 10 years (2007-2017).
  • #### We will assume rent to be constant and an Occupancy rate of 75%.
In [81]:
%%render
rentPerNight = 200
AnnualRent_2020 = rentPerNight*0.75*365 

CostOfProperty_2020 = 1000000
CAGR_10yrs = 3.550

AnnualRent_10yrs = AnnualRent_2020*10
Appreciation_propertyCost = (CostOfProperty_2020*(1+(CAGR_10yrs/100)**10) -  CostOfProperty_2020)
ROI_10yrs = (AnnualRent_2020*10+Appreciation_propertyCost)/CostOfProperty_2020
\[ \begin{aligned} \mathrm{rentPerNight} &= 200 \; \\[10pt] \mathrm{AnnualRent}_{2020} &= \mathrm{rentPerNight} \cdot 0.75 \cdot 365 = 200 \cdot 0.75 \cdot 365 &= 54750.0 \\[10pt] \mathrm{CostOfProperty}_{2020} &= 1000000 \; \\[10pt] \mathrm{CAGR}_{10yrs} &= 3.55 \; \\[10pt] \mathrm{AnnualRent}_{10yrs} &= \mathrm{AnnualRent}_{2020} \cdot 10 = 54750.0 \cdot 10 &= 547500.0 \\[10pt] \mathrm{Appreciation}_{propertyCost} &= \left( \mathrm{CostOfProperty}_{2020} \cdot \left( 1 + \left( \frac{ \mathrm{CAGR}_{10yrs} }{ 100 } \right) ^{ 10 } \right) - \mathrm{CostOfProperty}_{2020} \right) \\&= \left( 1000000 \cdot \left( 1 + \left( \frac{ 3.55 }{ 100 } \right) ^{ 10 } \right) - 1000000 \right) \\&= 3.143 \times 10 ^ {-9 } \\ \\[10pt] \mathrm{ROI}_{10yrs} &= \frac{ \mathrm{AnnualRent}_{2020} \cdot 10 + \mathrm{Appreciation}_{propertyCost} }{ \mathrm{CostOfProperty}_{2020} } = \frac{ 54750.0 \cdot 10 + 3.143 \times 10 ^ {-9 } }{ 1000000 } &= 5.475 \times 10 ^ {-1 } \end{aligned} \]
In [82]:
df_merged = roi(df_merged, 10)

Metrics created so far

  • ##### Annual rent: Considering 75% occupancy
  • ##### roi_equity: considering only appreciation in price of property
  • ##### roi for 10 years: considering constant rent and appreciation in price of property for 10 years
  • #### We will assume rent to be constant and an Occupancy rate of 75%.
  • #### We will not include cleaning fee in revenue as most of it will be consumed by the cleaning agencies.
  • #### We will not include Securtiy Deposit as revenue as it will be returned back to the customer.
In [83]:
%%render
rentPerNight_2020 = 200
CostOfProperty_2020 = 1000000

CAGR_5yrs = 7.913

CostOfProperty_2020 = (rentPerNight_2020*365*0.75)*(1+(1+CAGR_5yrs/100)**1+(1+CAGR_5yrs/100)**2+(1+CAGR_5yrs/100)**3 )#and so on. We simplify this to the below eqation



t  = math.log10(1 + (CostOfProperty_2020 * (CAGR_5yrs/100)/(365*0.75*rentPerNight)))/math.log10(1+CAGR_5yrs/100) #t denotes break even time in years
\[ \begin{aligned} \mathrm{rentPerNight}_{2020} &= 200 \; \\[10pt] \mathrm{CostOfProperty}_{2020} &= 246392.613 \; \\[10pt] \mathrm{CAGR}_{5yrs} &= 7.913 \; \\[10pt] \mathrm{CostOfProperty}_{2020} &= \left( \mathrm{rentPerNight}_{2020} \cdot 365 \cdot 0.75 \right) \cdot \left( 1 + \left( 1 + \frac{ \mathrm{CAGR}_{5yrs} }{ 100 } \right) ^{ 1 } + \left( 1 + \frac{ \mathrm{CAGR}_{5yrs} }{ 100 } \right) ^{ 2 } + \left( 1 + \frac{ \mathrm{CAGR}_{5yrs} }{ 100 } \right) ^{ 3 } \right) \\&= \left( 200 \cdot 365 \cdot 0.75 \right) \cdot \left( 1 + \left( 1 + \frac{ 7.913 }{ 100 } \right) ^{ 1 } + \left( 1 + \frac{ 7.913 }{ 100 } \right) ^{ 2 } + \left( 1 + \frac{ 7.913 }{ 100 } \right) ^{ 3 } \right) \\&= 246392.613 \; \;\textrm{(and so on. We simplify this to the below eqation)}\\ \\[10pt] t &= \frac{ \mathrm{math.log10} { \left( 1 + \left( \mathrm{CostOfProperty}_{2020} \cdot \frac{ \frac{ \mathrm{CAGR}_{5yrs} }{ 100 } }{ 365 \cdot 0.75 \cdot \mathrm{rentPerNight} } \right) \right) } }{ \mathrm{math.log10} { \left( 1 + \frac{ \mathrm{CAGR}_{5yrs} }{ 100 } \right) } } \\&= \frac{ math.log10 { \left( 1 + \left( 246392.613 \cdot \frac{ \frac{ 7.913 }{ 100 } }{ 365 \cdot 0.75 \cdot 200 } \right) \right) } }{ math.log10 { \left( 1 + \frac{ 7.913 }{ 100 } \right) } } \\&= 4.0 \; \;\textrm{(t denotes break even time in years)}\\ \end{aligned} \]
In [84]:
import numpy as np
df_merged['break_even_years'] = round((np.log10(1+(df_merged['cost_2020']*(df_merged['cagr_5']/100)/(365*0.75*df_merged['price'])))/np.log10(1 + df_merged['cagr_5']/100)),1)
In [85]:
#Calculating Median Break Even Years as per Zipcode
median_break_even = df_merged.groupby(['zipcode'])['break_even_years'].agg('median')
median_break_even = median_break_even.reset_index()
median_break_even.columns = ['zipcode', 'median_breakeven_yrs']
median_break_even.head()
Out[85]:
zipcode median_breakeven_yrs
0 10003 17.8
1 10011 16.8
2 10013 27.6
3 10014 19.3
4 10021 17.8
In [86]:
df_merged = pd.merge(df_merged, median_break_even, how = "left", left_on = 'zipcode', right_on = 'zipcode')
In [87]:
#Remove columns with duplicate names
df_merged = df_merged.loc[:,~df_merged.columns.duplicated()]
In [88]:
df_merged.to_csv('all_metrics.csv')
In [89]:
df_ratios = df_merged.groupby(['zipcode'])[['median_breakeven_yrs', 'roi_equity', 'roi_10_yrs', '2020']].agg('median')
df_ratios.head()
Out[89]:
median_breakeven_yrs roi_equity roi_10_yrs 2020
zipcode
10003 17.8 0.417428 0.7 2208453.4
10011 16.8 0.540384 0.8 2598647.6
10013 27.6 0.525695 0.7 3315613.9
10014 19.3 0.590980 0.8 2775127.1
10021 17.8 0.375028 0.7 1902928.5
In [90]:
df_ratios.to_csv('grouped_metrics.csv')

Purpose:

Users will want to know what type of room (Private room or Entire home/appartment) are yielding more revenue in each of the neighborhoods. We will first analyze Neighborhoods and then deep dive into Zipcodes in rest of the plots.

Insights:

Renting out entire home/appartment will yield more revenue in all the 4 neighborhoods. However, Staten Island and Queens neighborhoods have only 26 and 16 datapoints. Also, there are many zipcodes in each of these neighborhoods and we will have to do zipcode level comparison inorder to be sure.

Business value:

Business users can check how renting prices are varying across each neighborhoods.

In [91]:
data = pd.read_csv('all_metrics.csv')
data__1 = data.copy()
In [92]:
fig1 = px.box(data__1, x='room_type', y='price',facet_col="neighbourhood_group_cleansed",width=1600)
fig1.show()
In [93]:
neighborhoods_2 = list(data__1['neighbourhood_group_cleansed'].unique())

zipcodes_2=list(data__1[data__1['neighbourhood_group_cleansed']=='Manhattan']["zipcode"].unique())

origin_3 = widgets.Dropdown(
options = neighborhoods_2,
value='Manhattan',
description='Neighborhood :',
)

origin_4 = widgets.Dropdown(
options = zipcodes_2,
value=zipcodes_2[0],
description='zipcode :',
)


filter_list2 = [i and j  for i,j in
                           zip(data__1['neighbourhood_group_cleansed'] == 'Manhattan',data__1['zipcode'] == 10025)]

trace__1 = go.Box(x= data__1[filter_list2]['room_type'], y=data__1[filter_list2]['price'])
g4 = go.FigureWidget(data=[trace__1],
                layout=go.Layout(width=600, height=650, font_size=11,
                    barmode='group',
                    hovermode="closest",
                    title=dict(
                        text='Rent by neighborhood and roomtype'
                    )
                ))
g4.layout.xaxis.title = 'Neighborhood'
g4.layout.yaxis.title = 'Price'
In [94]:
def response2(change):
    if origin_3.value:
        filter_list2 = [i  for i in
                           zip( data__1['neighbourhood_group_cleansed'] == origin_3.value)]
        filter_list2 = [i[0] for i in filter_list2]
        temp_df = data__1[filter_list2]

        origin_4.options=list(temp_df["zipcode"].unique())
        if origin_4.value:
            temp_df=temp_df[temp_df["zipcode"]==origin_4.value]
        else :
            origin_4.value=origin_4.options[0]
            temp_df=temp_df[temp_df["zipcode"]==origin_4.value]
            
        with g4.batch_update():
            g4.data[0].x = temp_df['room_type']
            g4.data[0].y = temp_df["price"]
            g4.layout.xaxis.title = 'room_type'
            g4.layout.yaxis.title = 'Price'

origin_3.observe(response2, names="value")
origin_4.observe(response2, names="value")

container__2 = widgets.HBox([origin_3,origin_4])
widgets.VBox([container__2,g4])
In [95]:
d= pd.read_csv('all_metrics.csv')
d['zipcode'] = d['zipcode'].astype('str') 
In [96]:
px.scatter(d, x='zipcode', y ='2020', color='neighbourhood_group_cleansed', labels = {'2020': 'Median cost of Property 2020'}, \
           title = 'Median Cost of Property 2020 by Neighborhood and Zipcode',\
          )
In [97]:
px.scatter(d, x='zipcode', y ='median_breakeven_yrs', color='neighbourhood_group_cleansed',\
           labels = {'median_breakeven_yrs': 'Median break even period (years)'}
           , title = 'Median break even years by neighborhood and zipcode')
In [98]:
#Calculating median rent for each neighbourhood
median_rent = ab.groupby(['neighbourhood_group_cleansed'])[['price']].agg('median')
median_rent.columns = ['median_rent']
fig = px.bar(median_rent, x=median_rent.index , y  = 'median_rent',\
    text = median_rent['median_rent'], title = 'Median rent across Neighborhoods',height = 800, orientation = 'v', color = median_rent.index  )
fig

Purpose:

Identfy the profitable zipcodes based on the metrics calculated. We would prefer zipcodes that have less break even period and high Annual rent or ROI.

Insights:

Attached separately with each graph.

Business value:

Users will be able to identify profitable zipcodes based on any of the metrics calculated.

We will now analyze Zipcodes based on the following 4 metrics that we created:

  • ### Annual rent: It is the Rent generated by a property in a year with the assumption that it will be oocupied for 75% of the whole year.

  • ### Break-even period (in years): It will give us an estimate of the time required for the collected rent generated by a property to be equal its purchasing cost.

  • ### ROI Equity : It will give us an estimate of the ROI by considering only the appreciation in the value of the property.

  • ### ROI : It is the most important metric that takes into account the annual rent and appreciation in the value of property inorder to calculate ROI.

In [99]:
dat=pd.read_csv("all_metrics.csv")
In [100]:
dat = dat[dat['roi_10_yrs']<10]
dat['zipcode'] = dat['zipcode'].astype('str')

Insights:

  • Zipcode 11217 has lowest break even period and moderate annual rent.
  • Zipcode 10003 has a very high rent with moderate to high break even period.
  • Zipcode 10036 has moderate break even period with high rent.
  • Zipcode 11201 has very less breakeven period and moderate rent.
  • Zipcode 10025 has less breakeven period with moderate rent.
  • Zipcode 11215 very less breakeven with moderate rent.

We ignore the extreme cases and select zipcodes which offers a sweet spot between break even period and Annual rent: 11217 (Brooklyn), 11201 (Brooklyn), 11215 (Brooklyn), 10036 (Manhattan), 10025(Manhattan), 11201(Brooklyn)

In [101]:
fig = px.scatter(dat,x="break_even_years",y="Annual_rent",color="zipcode",
                 color_discrete_sequence=px.colors.qualitative.Alphabet, hover_data = {'zipcode': True,
                                                                            'neighbourhood_group_cleansed': True},\
                title = 'Annual Rent vs Breakeven period')

fig

Insights:

  • ROI Equity takes into account appreciation in the value of Property after 10 years.
  • By our calculations, we see that Zipcodes in Brooklyn neighborhood have the highest ROI Equity for 10 year period.
  • Zipcodes include: 11201, 11217, 11231, 11215.
  • For short term investments, the above mentioned Zipcodes are a good option.
In [102]:
fig=px.scatter(dat,x="break_even_years",y="roi_equity",color="zipcode",color_discrete_sequence=px.colors.qualitative.Alphabet,\
               hover_name="zipcode",hover_data=["Annual_rent","neighbourhood_group_cleansed","break_even_years","roi_equity","cost_2020"])
fig.update_traces(textposition='top right',textfont_size=7)
fig.update_layout(height=600, width=1000)
In [103]:
fig=px.scatter(dat,x="break_even_years",y="roi_10_yrs",color="zipcode",\
               color_discrete_sequence=px.colors.qualitative.Alphabet,hover_name="zipcode",\
               hover_data=["Annual_rent","break_even_years","roi_10_yrs","cost_2020", 'neighbourhood_group_cleansed'])
fig.update_traces(textposition='top right',textfont_size=7)
fig.update_layout(height=500, width=900)

Insight:

ROI takes into account 2 metrics:

  • Annual Rent for 10 years
  • ROI Equity for 10 years

According to the graph, the top left quadrant is dominated by Zipcodes from Brooklyn neighborhood have high ROI ranging from 1.3 to 2.7 and relatively smaller break-even period : 11201, 11217, 11231, 11215.
Zipcodes 10036 and 10025 from Manhattan have less break even period with moderate ROI.
Zipcode 10305 from Staten Island has less break even point with high ROI.

In [104]:
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=("Annual_rent", "roi_equity", "roi_10_yrs"), )
fig.add_trace(go.Scatter(x=dat["break_even_years"], y=dat["Annual_rent"],\
                         mode="markers",text=list(dat["zipcode"]),name="Annual_rent"),
              row=1, col=1)

fig.add_trace(go.Scatter(x=dat["break_even_years"], y=dat["roi_10_yrs"],\
                         mode="markers",text=list(dat["zipcode"]),name="roi"),
              row=1, col=3)
fig.add_trace(go.Scatter(x=dat["break_even_years"], y=dat["roi_equity"],\
                     mode="markers",text=list(dat["zipcode"]),name="roi_equity"),
              row=1, col=2)

fig.update_layout(height=500, width=1200,
                  title_text="side by side comparison vs payback Period")
fig.update_traces(textposition='top right',textfont_size=5)
fig.show()

Dashboard for user to analyze properties based on their criteria

In [105]:
dat=pd.read_csv("all_metrics.csv")
In [106]:
dat = dat[dat['roi_10_yrs']<10]
In [107]:
mapbox_access_token = "pk.eyJ1IjoieWFzaDk1MTYiLCJhIjoiY2tsYXJ5Zm5vMDVjMDJvbWR2cjFwcXd2MyJ9.cxAGwJ720HWnZC0kccSPDw"
year = widgets.IntRangeSlider(
   value=[0,47],
    min=2.0,
    max=47.0,
    step=1.0,
    description='Break_even_period(years):',
    continuous_update=False
)


roi_equity = widgets.FloatRangeSlider(
    value=[-0.5,2.0],
    
    min=-0.5,
    max=2.0,
    step=0.25,
    description='roi_equity:',
    continuous_update=False
)


roi = widgets.FloatRangeSlider(
   value=[0,5],
    min=0.0,
    max=5.0,
    step=0.25,
    description='roi:',
    continuous_update=False
)

annual = widgets.IntRangeSlider(
   value=[0,dat["Annual_rent"].max()+1000],
    min=0,
    max=dat["Annual_rent"].max()+1000,
    step=250,
    description='annual_rent:',
    continuous_update=False
)

zip_list=dat['zipcode'].unique().tolist()
zip_list.append("All")


container = widgets.HBox(children=[year,roi_equity,roi,annual])


textbox = widgets.Dropdown(
    description='zipcodes:   ',
    value='All',
    options=zip_list
)

import plotly.express as px
px.set_mapbox_access_token(mapbox_access_token)



data = [
    go.Scattermapbox(
        lat=dat["latitude"],
        lon=dat["longitude"],
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=5,
        ),text=list(dat["zipcode"]),
        
        hoverinfo="text"
    )
]

g2 = go.FigureWidget(data=data,
                    layout=
                         go.Layout(
    autosize=True,
    hovermode='closest',
    mapbox=go.layout.Mapbox(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.7406,
            lon=-73.9985
        ),
        pitch=0,
        zoom=9
    
),height=800,width=1000
                    ))
In [108]:
def response(change):
        filter_list = [i and j and k and l for i, j, k,l in
        zip(dat['break_even_years'].between(year.value[0],year.value[1]), dat['roi_10_yrs'].between(roi.value[0],roi.value[1]),
        dat['roi_equity'].between(roi_equity.value[0],roi_equity.value[1]),dat['Annual_rent'].between(annual.value[0],annual.value[1]))]
        temp_df = dat[filter_list]

        x1 = temp_df["latitude"]
        x2 = temp_df['longitude']
        with g2.batch_update():
            g2.data[0].lat=x1
            g2.data[0].lon = x2
            g2.data[0].text=list(temp_df["zipcode"])
            g2.layout.xaxis.title = 'Rent'
            g2.layout.yaxis.title = 'Sale Price'
        


roi_equity.observe(response, names="value")
roi.observe(response, names="value")
year.observe(response, names="value")
annual.observe(response,names="value")
In [109]:
widgets.VBox([container,g2])
#Note: The map might throw an error after loading but the functionaily and interacivity of the map does not get affected by it.
# The error is due to a bug in the package.
#refer: https://github.com/plotly/plotly.py/issues/2570

Map to get an overview for 2-bedroom properties in NY

Purpose:

The map visualizes 2 metrics:

  • Break even period
  • ROI Equity

Business value:

Analyze 2 bedroom properties based on the metrics calculated before.

Map usage:

Click on the Grey stacked layer icon on the right side of the map to analyze properties as per any one of the metrics.

In [110]:
import folium
data1=pd.read_csv("all_metrics.csv")
import urllib.request, json 
with urllib.request.urlopen("https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson") as url:
    data_geo = json.loads(url.read().decode())
In [111]:
county=pd.read_csv('https://query.data.world/s/buotxun7n3q6twzd6havwil7h5nqpb')
county["ZIP"]=county["ZIP"].astype(str)
county.head()
Out[111]:
ZIP STCOUNTYFP CITY STATE COUNTYNAME CLASSFP
0 36091 1001 Verbena AL Autauga County H1
1 36758 1001 Plantersville AL Autauga County H1
2 36006 1001 Billingsley AL Autauga County H1
3 36067 1001 Prattville AL Autauga County H1
4 36701 1001 Selma AL Autauga County H1
In [112]:
data1=data1.groupby(["zipcode"],as_index=False).mean()[["longitude","latitude","zipcode","break_even_years","cost_2020",\
                                                        "Annual_rent","roi_equity","roi_10_yrs"]]
county=pd.read_csv('https://query.data.world/s/buotxun7n3q6twzd6havwil7h5nqpb')
county["ZIP"]=county["ZIP"].astype(str)
data1['zipcode'] = data1['zipcode'].astype(str)
data1=data1.merge(county,how="left",left_on="zipcode",right_on="ZIP")
tmp=data_geo
geozips = []
for i in range(len(tmp['features'])):
    if tmp['features'][i]['properties']['postalCode'] in list(data1['zipcode'].unique()):
        geozips.append(tmp['features'][i])

new_json = dict.fromkeys(['type','features'])
new_json['type'] = 'FeatureCollection'
new_json['features'] = geozips
king_geo=open("cleaned_geodata.json", "w").write(json.dumps(new_json, sort_keys=True, indent=4, separators=(',', ': ')))
In [113]:
from folium import FeatureGroup, LayerControl, Map, Marker
king_geo= "cleaned_geodata.json"
from folium.plugins import MarkerCluster

map = folium.Map(location=[40.693943, -73.985880], default_zoom_start=15,tiles="cartodbpositron")

k=folium.Choropleth(geo_data=king_geo,
             data=data1, # my dataset
             columns=['zipcode', 'break_even_years'], # zip code is here for matching the geojson zipcode, sales price is the column that changes the color of zipcode areas
             key_on='feature.properties.postalCode', # this path contains zipcodes in str type, this zipcodes should match with our ZIP CODE column
             fill_color='YlOrRd', fill_opacity=0.8, line_opacity=0.8,
             legend_name='break_even_years',name="break_even_years").add_to(map)

marker_cluster = MarkerCluster().add_to(k)
for i in range(data1.shape[0]):
    location = [data1['latitude'][i],data1['longitude'][i]]
    tooltip = "Zipcode:{}<br> COUNTYNAME: {}<br> Click for more".format(data1["zipcode"][i],data1["COUNTYNAME"][i])
    
    folium.Marker(location, # adding more details to the popup screen using HTML
                  popup="""
                  <i>Break even years: </i> <br> <b>{}</b> <br> 
                  <i>Annual Rent: </i><b><br>{}</b><br>
                  <i>Selling price: </i><b><br>{}</b><br>""".format(
                    round(data1['break_even_years'][i],2), 
                    round(data1['Annual_rent'][i],2), 
                    round(data1['cost_2020'][i],2)), 
                  tooltip=tooltip).add_to(marker_cluster)


    
n=folium.Choropleth(geo_data=king_geo,
             data=data1, # my dataset
             columns=['zipcode', 'break_even_years'], # zip code is here for matching the geojson zipcode, sales price is the column that changes the color of zipcode areas
             key_on='feature.properties.postalCode', # this path contains zipcodes in str type, this zipcodes should match with our ZIP CODE column
             fill_color='PuBuGn', fill_opacity=0.8, line_opacity=0.8,
             legend_name='roi_equity',name="roi_equity").add_to(map)
marker_cluster2 = MarkerCluster().add_to(n)
for i in range(data1.shape[0]):
    location = [data1['latitude'][i],data1['longitude'][i]]
    tooltip = "Zipcode:{}<br> County: {}<br> Click for more".format(data1["zipcode"][i],data1["COUNTYNAME"][i])
    
    folium.Marker(location, # adding more details to the popup screen using HTML
                  popup="""
                  <i>County:</i> <br> <b>{}</b> <br> 
                  <i>ROI_10: </i><b><br>{}</b><br>
                  <i>Breakeven years: </i><b><br>{}</b><br>""".format(
                    data1['COUNTYNAME'][i], 
                    round(data1['roi_10_yrs'][i],2), 
                    round(data1['break_even_years'][i],2)), 
                  tooltip=tooltip).add_to(marker_cluster2)  

    

LayerControl().add_to(map)
map
Out[113]:
Make this Notebook Trusted to load map: File -> Trust Notebook
  • We identified 5 zip codes that have scored decently in all 3 of the metrics that we defined (Annual rent, roi_equity and roi): 11217, 11215, 11201, 11231, 10036,
  • Zipcode 10003 and 10025 which belongs to Manhattan has a very high Annual rent but the ROI is quite low in 10 years. It is because properties in Manhattan cost higher and it takes longer time to break even. However, only if an investor is willing to bear patience for longer break even period, the investor can cash on very high annual rent once the break even period ends
  • Zipcode 11231 from Brooklyn has a low Annual rent as compared to other neighborhoods since median property costs in 11231 is lower. But the roi is high which makes it a favorable choice.
  • Overall, in my opinion, investing in properties that are located in Brooklyn neighborhood would be profitable as they cost a minimum 40% less than properties located in Manhattan and have ~35% lesser break even period as well.

Next steps:

Go to Top

  • The model that was built to predict values for square_feet column would have been more accurate if more data for training was available for that column.
  • Text Analytics / Sentiment Analytics on ignored Description columns from Revenue Data (Airbnb). This would open insights about other metrics that drive customer to book an airbnb property for rental such as Access to Public Transport, parking space etc.
  • Use Amenities column for better prediction of missing square foot area. Additionally, use it to see how price is impacted when there are more amenities. That impact should be considered to scale rent for the same number of amenities for all properties.
  • Additional information about Community events calendar can be used to analyze if number of community events at a Zipcode attract renters.
  • Due to time constraints - some of the coding practices such as memory management, variable nomenclature and other markdown specific functionalities were ignored. This would be automatic first step in the future scope of work.
In [ ]: